Analytics-Driven Lossless Data Compression for Rapid In-situ Indexing, Storing, and Querying

نویسندگان

  • John Jenkins
  • Isha Arkatkar
  • Sriram Lakshminarasimhan
  • Neil Shah
  • Eric R. Schendel
  • Stéphane Ethier
  • Choong-Seock Chang
  • Jacqueline Chen
  • Hemanth Kolla
  • Scott Klasky
  • Robert B. Ross
  • Nagiza F. Samatova
چکیده

The analysis of scientific simulations is highly data-intensive and is becoming an increasingly important challenge. Peta-scale data sets require the use of light-weight query-driven analysis methods, as opposed to heavy-weight schemes that optimize for speed at the expense of size. This paper is an attempt in the direction of query processing over losslessly compressed scientific data. We propose a co-designed double-precision compression and indexing methodology for range queries by performing unique-value-based binning on the most significant bytes of double precision data (sign, exponent, and most significant mantissa bits), and inverting the resulting metadata to produce an inverted index over a reduced data representation. Without the inverted index, our method matches or improves compression ratios over both general-purpose and floating-point compression utilities. The inverted index is light-weight, and the overall storage requirement for both reduced column and index is less than 135%, whereas existing DBMS technologies can require 200-400%. As a proof-of-concept, we evaluate univariate range queries that additionally return column values, a critical component of data analytics, against state-of-the-art bitmap indexing technology, showing multi-fold query performance improvements.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Ty: Lossless Data Compression for Analytics-driven Query Processing

ARKATKAR, ISHA. ALACRI2TY: Lossless Data Compression for Analytics-driven Query Processing. (Under the direction of Nagiza F. Samatova.) Analysis of scientific simulations is highly data-intensive and is becoming an increasingly important challenge. Peta-scale data sets require us to look for alternative ways of performing query-driven analyses. This thesis is an attempt in the direction of que...

متن کامل

ALACRITY: Analytics-Driven Lossless Data Compression for Rapid In-Situ Indexing, Storing, and Querying

High-performance computing architectures face nontrivial data processing challenges, as computational and I/O components further diverge in performance trajectories. For scientific data analysis in particular, methods based on generating heavyweight access acceleration structures, e.g. indexes, are becoming less feasible for ever-increasing dataset sizes. We present ALACRITY, demonstrating the ...

متن کامل

AMR-aware in situ indexing and scalable querying

Spring Simulation Multi-Conference 2016 April 3-6, Pasadena, CA, USA c ©2016 Society for Modeling & Simulation International (SCS) ABSTRACT Query-driven analytics on scientific datasets is one of fundamental approaches for scientific discoveries. Existing studies have explored query-driven analytics on uniform resolution meshes. However, querying on adaptive mesh refinement (AMR) data has not b...

متن کامل

An Efficient Technique for Text Compression

For storing a word or the whole text segment, we need a huge storage space. Typically a character requires 1 Byte for storing it in memory. Compression of the memory is very important for data management. In case of memory requirement compression for text data, loseless memory compression is needed. We are suggesting a lossless memory requirement compression method for text data compression. Th...

متن کامل

Lossless Microarray Image Compression by Hardware Array Compactor

Microarray technology is a new and powerful tool for concurrent monitoring of large number of genes expressions. Each microarray experiment produces hundreds of images. Each digital image requires a large storage space. Hence, real-time processing of these images and transmission of them necessitates efficient and custom-made lossless compression schemes. In this paper, we offer a new archi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012